compareMS2 2.0: An Improved Software for Comparing Tandem Mass Spectrometry Datasets

It has long been known that biological species can be identified from mass spectrometry data alone. Ten years ago, we described a method and software tool, compareMS2, for calculating a distance between sets of tandem mass spectra, as routinely collected in proteomics. This method has seen use in species identification and mixture characterization in food and feed products, as well as other applications. Here, we present the first major update of this software, including a new metric, a graphical user interface and additional functionality. The data have been deposited to ProteomeXchange with dataset identifier PXD034932.


■ INTRODUCTION
A decade ago, Palmblad and Deelder 1 first described a method for molecular phylogenetics based on direct comparison of tandem mass spectra. The method has since seen a range of applications, including food 2,3 and feed 4−7 species identification, quality control, 8 and experimental design. 9 Similar works include the DISMS2 library by Rieder and colleagues 10 and MS1-only methods for "sequence-free" phylogenetics reviewed by Downard. 11 Neely and Palmblad 12 recently placed these methods in a larger historical context, going all the way back to the seminal comparison of separated tryptic peptides across species by Zuckerkandl, Jones, and Pauling in 1960. 13 Here, we describe a new and significantly updated version of the original compareMS2 software, with several improvements, including a graphical user interface (GUI) controlling all steps of the analysis and dynamic phylogenetic tree display, a fully symmetric distance metric, and many additional filters and output options, which we describe in this technical note.

Symmetric Distance Measure
The original compareMS2 compared two sets of tandem mass spectra, e.g., those resulting from liquid chromatography− tandem mass spectrometry, by scanning one set and for each spectrum finding the best match in the other set (within precursor m/z and retention time tolerances). The results depended on which set was scanned, and the distance metric was only approximately symmetric. compareMS2 2.0 has a perfectly symmetric measure of the distance between sets of tandem mass spectra regardless of order of input. In this section, we describe this modified measure and some of its properties.

Comparing Pairs of Spectra
The comparison between sets of tandem mass spectra starts with the comparison of pairs of spectra. There are many measures of spectral similarity. compareMS2 supports the cosine score (dot product) and spectral angle. By default, compareMS2 uses the cosine score, i.e., the cosine of the angle between the vector representations of the spectra, after normalizing both spectra to unit length: where θ is the angle between the vector representations of the two spectra. Equation 1 is symmetric in a and b.
Optionally, compareMS2 can first scale spectra to reduce the influence of very intense peaks, e.g., by taking the square or cube root of all intensities. All peaks below a user-defined or automatically detected relative or absolute background can also be excluded from the similarity calculation.
Comparing Sets of Spectra compareMS2 2.0 defines the similarity between two sets of tandem mass spectra, and as follows. If for a spectrum a∈ we find a spectrum b∈ with s(a,b) greater than or equal to a minimum similarity threshold s min , we say that a has a similar spectrum in . We then define a subset S ⊂ , given , of all spectra in with at least one similar spectrum in as and a corresponding subset We then define a global similarity between sets ≠ Ø and ≠ Ø, S( , ), as the average of the fraction of spectra in with at least one similar spectrum in and the fraction of spectra in with at least one similar spectrum in : where | | denotes the cardinality, the number of elements, in a set . Though in some use cases it may be meaningful to define the similarity between two empty sets, i.e., LC-MS/MS datasets without tandem mass spectra, or the similarity between an empty and a non-empty set, we have chosen to leave these undefined and have the compareMS2 output reflect this. We believe this makes sense as a dataset without tandem mass spectra usually suggests something went wrong during measurement. Values can always be imputed after the compareMS2 runs, and rows with undefined values in the distance matrix can be excluded in subsequent analyses in most phylogenetic software. From the symmetry of eq 4, we see that S( , ) = S( , ). We also note that both terms in eq 4 are non-negative, therefore Since S( , ) is symmetric, D( , ) is also symmetric. Note that D( , ) → ∞ as | | → ∞∧| | → ∞,and there are no similar spectra in and . In the special case of and both containing a single spectrum, D( , ) is 0 if the spectra are similar and 1 otherwise. The definition of the distance between sets with S( , ) = 0 correspond to and having a hypothetical half matching spectrum. In most real-world use cases, both and would contain thousands of spectra.
Two co-directional spectra�spectra whose vector representations differ only by a factor�are considered identical by s. Therefore, datasets containing perfectly co-directional spectra would have a global similarity S = 1 and distance D = 0. Strictly speaking, D is not a metric in the mathematical sense, as the identity of indiscernibles (D( , ) = 0 ⇔ = ) no longer holds after normalizing the spectra. This is by design, as the absolute intensities in a tandem mass spectrum depend not only Figure 1. CompareMS2 2.0 workflow, orchestrated by the graphical user interface. After parsing and checking the input parameters, ensuring all files are present and in the correct format, compareMS2 performs (N 2 − N)/2 pairwise comparisons of N datasets using the symmetric distance measure described below, or N 2 − N comparisons if the original measure is used. After each row is completed, compareMS2 updates the (strictly triangular) distance matrix and generates a new tree. This allows the user to monitor progress and terminate and restart the run if necessary. If the original measure is used, compareMS2 by default creates both the strictly upper and lower triangular distance matrices (these can be averaged in phylogenetics software such as MEGA).

Journal of Proteome Research pubs.acs.org/jpr
Technical Note on the peptide sequence and abundance, but also at which point or points during the chromatographic peak the peptide was selected for MS/MS, which is generally not reproducible. As comparing all tandem mass spectra is computationally expensive, especially for large datasets. compareMS2 allows approximation of D( , ) by only comparing a spectrum a∈ with those spectra b∈ that fall within user-defined windows of retention time or scan number, and precursor m/z. compareMS2 Pipeline compareMS2 takes as minimum input a directory of MGF files to be compared. We choose MGF as the default input format, as it is convenient for storing MS2-only data and the MGF files can easily be filtered, split or combined, which may be useful in some applications of compareMS2, such as when fractionating samples or removing nonpeptide spectra. Most vendor software as well as msconvert 14 can convert raw data or mzML files to MGF. To provide faster feedback to the user, compareMS2 2.0 interleaves distance matrix calculations, updates and displays a phylogenetic tree as each row of the distance matrix is completed (Figure 1). With the default symmetric metric, this matrix is triangular, hence the tree is updated rapidly in the beginning, after the first comparison, and then again after the next two comparisons etc. Version 2.0 also provides additional functionality, such as recording a quality control metric for each dataset (by default the number of tandem mass spectra in the dataset) and a filter to compare only the top-N most intense tandem mass spectra from each dataset. The datasets can be compared in alphabetical, size or random order. By default, compareMS2 outputs a MEGA (.meg) file, but Newick and NEXUS formats are also supported.

compareMS2 GUI
Technically, compareMS2 2.0 combines two software tools, which can also be run individually on the command line. The first component compares two datasets, e.g., from LC-MS/MS. The second component takes several such comparisons, combines samples from the same biological species, and computes a distance matrix. The graphical user interface ( Figure  2) was designed to be simple to use, hiding most of the internal complexity of compareMS2, including the interleaved execution order of the two components ( Figure 1).

Source Code and Availability
The compareMS2 source code can freely be downloaded from https://github.com/524D/compareMS2. On Windows, the software can be installed using a simple installer. compareMS has been tested on Windows 10, Ubuntu 20.04 Linux and MacOS 12. The GUI is based on Electron (https://www. electronjs.org/) and is written in Javascript, HTML, and CSS. It uses the phylotree.js library 15 to render the graphical tree representation. Conversion of the distance matrix into Newick format uses the UPGMA method is and is also implemented in JavaScript. The distance computation and distance matrix creation are performed by two command-line programs written in C. These can be used to run compareMS without the GUI. Source code and prebuild executables of the command-line tools can be found in the external_binaries directory of the compareMS2 repository.

Experimental Features
As compareMS2 provides a basic framework for comparing tandem mass spectra across datasets, we have begun to add experimental features to help visualize such comparisons. The first of these experimental outputs is a two-dimensional histogram of precursor m/z difference and spectral similarity for all comparisons of spectra between two datasets. These features will only be available on the command-line, and require additional software such as R to generate figures, but allow for example correlating spectral similarity with precursor mass difference. Scripting examples in R are available on https://osf. io/jey28/.

Testing
To demonstrate the features and performance of compareMS2 2.0, we used previously published amaZon ion trap (Bruker Daltonics) and Orbitrap Fusion Lumos (Thermo Fisher Scientific) data from primate sera and an E. coli lysate. 1,12 In addition, we used new data acquired on the same Orbitrap instrument and as described in 12 from California sea lion (Zalophus californianus), dog (Canis lupus familiaris), rock hyrax (Procavia capensis), and white-tailed deer (Odocoileus virginianus) sera. The mass spectrometry proteomics data have been deposited to the ProteomeXchange Consortium via the PRIDE 16 partner repository with the dataset identifier PXD034932 and 10.6019/PXD034932. Phylogenetic trees were generated by compareMS2 and MEGA11 17 using default parameters for both (for compareMS2 maximum precursor mass difference 2.05, score cutoff 0.8, minimum basepeak intensity 10000, minimum total ion current 0, maximum retention time difference 60, start retention time 0, end retention time 100000, maximum scan number difference 10000, start scan 1, end scan 1000000, scaling 0.5, noise 10, version of set distance metric 2, version of QC metric 0, compare only the N most intense spectra set to "All", output format "MEGA",and compare order "Smallest-largest first", and for MEGA11 "Lower Left Matrix" and "Pairwise Distance" input data for UPGMA Phylogeny Analysis).

■ RESULTS AND DISCUSSION
The compareMS2 2.0 GUI ( Figure 2) displays a phylogenetic tree with a quality metric mapped to a continuous or divergent color gradient, the tree being continuously updated to provide real-time feedback to the user. This allows executions to be paused or terminated at any stage, which may be useful for large jobs. For example, comparing 100 LC-MS/MS datasets require 4950 pairwise comparisons, taking several hours. But already after six pairwise comparisons of four datasets, trees can be quite informative and reveal if there is an issue with the input files or parameters.
Using the five new serum datasets, each containing between 42,629 and 47,626 tandem mass spectra, we could reconstruct the correct phylogenetic tree in compareMS2 and MEGA11 (Figure 3). The 10 pairwise comparisons in compareMS2 took 40 min with default parameters on a PC with an Intel Xeon W-2135 CPU running at 3.70 GHz. The analyses can be accelerated by comparing spectra within a more narrow m/z window than the default value of 2.05. Each comparison is independent, so in principle the problem is embarrassingly parallel.
To test one of the experimental features, we compared the similarity between tandem mass spectra as a function of precursor m/z difference for comparisons between two closely related species -human and chimpanzee -as well as two species with few shared tryptic peptides�human and E. coli (Figure 4). These comparisons reveal information on spectral similarity, but also on mass measurement precision, charge states and isotope errors before and independent of any database search, where such parameters typically have to be provided. In these datasets, charge states up to [M + 6H] 6+ and isotope errors up to at least 3 Da are observed. The analysis can also be used to estimate suitable parameters for compareMS2, e.g., m/z windows and spectral similarity thresholds. We also observe some unexpected side bands most noticeable at 1/2 and 1/3 Da, but not near zero, in the Orbitrap data. These bands are also seen in comparisons of spectra within individual datasets.
When combined with posterior error probability estimators such as PeptideProphet 19 or Percolator, 20 spectral similarity measures can in principle be converted into probabilities for any pair of spectra being derived from the same or closely related analytes. When searching spectral libraries, the probability that a query spectrum matches the library spectrum is multiplied with the original probability that library spectrum was correctly identified to estimate the probability the query spectrum is correctly matched to a peptide or other analyte. The compareMS2 software uses the spectral similarity in eq 1 to calculate the overlap between sets of tandem mass spectra without regard to their identification to a specific analyte.
Naively, one may attempt to use something like the Jaccard similarity, J, defined as the cardinality of the intersection divided by the cardinality of the union However, no two spectra are exactly the same. If the criterion for considering two spectra identical (as in derived from the same peptide) for the purpose of calculating | ∩ | and | ∪B| is too strict, then one will underestimate | ∩ | and overestimate | ∪ |. If the criterion is too lax, then one overestimates | ∩ | and underestimates | ∪ |. In either case, the errors would multiply, making the Jaccard similarity very sensitive to the precise definition of when two spectra are considered identical. Even more problematic is the intransitive nature of this identity, which is exacerbated by chimeric spectra�spectra that are superpositions of two or more peptide tandem mass spectra. Briefly, a pure spectrum from peptide P can be considered identical to a chimeric spectrum with a small contribution from a second, cofragmenting peptide Q, which in turn is identical to a chimeric spectrum with slightly larger contribution from peptide Q, and so on, eventually ending up with the pure spectrum of peptide Q, which can be very different from the original spectrum from peptide P, just like messages in a game of telephone. This is why exercises clustering large numbers of tandem mass spectra based on spectral similarity tend to produce large globs of spectra rather than a distinct cluster for each peptide. The evolutionary history was inferred using the UPGMA method. 18 The optimal tree is shown and drawn to scale, with branch lengths in the same units as those generated by compareMS2 and used to infer the phylogenetic tree. Taxon images are from PhyloPic.
■ CONCLUSIONS compareMS2 compares sets of tandem mass spectra to each other rather than to predicted spectra of specific peptides as when identifying proteins from tandem mass spectra. We have used examples from molecular phylogenetics, but many other uses have been demonstrated, including food and feed identification, mixture analysis and experimental design. compareMS2 may also be used data quality control -comparing large numbers of datasets prior to database search and protein quantitation to detect outliers and possible batch effects. The visualization of spectral similarity as a function of precursor mass difference gives another window into the data, and can suggest . Similarity of tandem mass spectra as a function of precursor m/z difference in Orbitrap Fusion Lumos (A,B) and amaZon ion trap data (C,D), comparing similar (human and chimpanzee sera) and dissimilar (human serum and E. coli) samples. Panels A and B compare two LC-MS/MS runs, and panels C and D compare four runs per species (16 comparisons). Similar spectra have precursor m/z differences near zero or a near a rational number corresponding to the isotope error at a specific charge state (shown more clearly in panel E, generated from 8 Orbitrap human serum datasets). parameters for database searches a priori. We make compar-eMS2 freely available as open source and provide an automatic installer for Microsoft Windows in hope that it may be as useful to others as it has been for us.